ci: populate baselines and gate benchmarks at +20%#91
Conversation
Add check_benchmark_regression.py and reduce_baselines.py; populate benchmarks/baselines.json from post-cache run; wire regression gate into ubuntu benchmarks job; add Makefile update-baselines target and tests.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a benchmark regression gate to CI by introducing two new scripts ( ChangesBenchmark Regression Gate
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
scripts/check_benchmark_regression.py (2)
12-27: ⚡ Quick winConsider adding defensive error handling for malformed input.
Both
load_resultsandload_baseline_meansassume well-formed JSON with the expected schema (e.g.,data["benchmarks"],entry["stats"]["mean"]). Whilepytest-benchmarkshould always produce correct output, missing keys or malformed JSON would raise unhelpfulKeyErrororTypeErrorexceptions in CI.Adding try/except blocks with actionable error messages would improve debuggability when investigating CI failures.
🛡️ Proposed defensive improvements
def load_results(results_path: str | Path) -> dict[str, float]: - data = json.loads(Path(results_path).read_text(encoding="utf-8")) - return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]} + try: + data = json.loads(Path(results_path).read_text(encoding="utf-8")) + return {entry["name"]: float(entry["stats"]["mean"]) for entry in data["benchmarks"]} + except (KeyError, TypeError) as e: + print(f"ERROR: malformed results JSON at {results_path}: {e}", file=sys.stderr) + raiseSimilar pattern for
load_baseline_means.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 12 - 27, Both load_results and load_baseline_means functions lack error handling for malformed JSON input and missing expected keys. Add try/except blocks around the JSON parsing and data access logic in both functions to catch KeyError and TypeError exceptions. When these exceptions occur, raise a new exception with an actionable error message that includes the file path and details about which expected key or structure was missing. This will provide meaningful error context instead of unhelpful KeyError exceptions when CI receives malformed JSON.
45-45: ⚡ Quick winAdd guard against division by zero.
If a baseline mean is
0, line 45 will raiseZeroDivisionError. While benchmarks with zero mean are implausible, defensive validation would prevent cryptic CI failures.🛡️ Proposed guard
+ if base == 0: + print(f"WARN: baseline {name!r} has zero mean; skipping") + continue ratio = cur / base🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` at line 45, The ratio calculation at the line where ratio is assigned as cur divided by base lacks protection against division by zero. Add a guard condition to check if the base value is zero before performing the division operation. If base is zero, handle it appropriately by either skipping the ratio calculation, assigning a special value, or logging a warning. This defensive check will prevent ZeroDivisionError exceptions even though zero baseline means are unlikely in practice.scripts/reduce_baselines.py (1)
44-49: ⚡ Quick winValidate
--slackparameter bounds.The
--slackargument accepts any float, including negative or zero values, which would produce incorrect baseline means. Adding validation would prevent user errors.♻️ Proposed validation
parser.add_argument( "--slack", type=float, default=1.0, help="multiply means by this factor (e.g. 1.25 when capturing on a faster host)", ) args = parser.parse_args(argv) + if args.slack <= 0: + parser.error("--slack must be positive") reduce_baselines(args.raw_path, args.out_path, slack=args.slack)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/reduce_baselines.py` around lines 44 - 49, The `--slack` argument parser definition lacks validation to ensure the parameter value is valid for use as a multiplier. Add a constraint to the `--slack` argument definition to validate that the value is greater than zero, rejecting negative or zero values that would produce incorrect baseline means. You can use either the `choices` parameter with a custom validation function or leverage the `type` parameter with a callable validator to enforce this constraint.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/baselines.json`:
- Line 4: The "machine" field in benchmarks/baselines.json indicates the
baselines were captured on Windows, but the CI gates require ubuntu-latest
baselines to prevent cross-OS timing variance from causing false
positives/negatives. Recapture the baselines on an ubuntu environment by either
running "make update-baselines" after the ubuntu environment is available, or
manually running pytest with "--benchmark-only
--benchmark-json=benchmarks/_raw.json" flags followed by executing the
scripts/reduce_baselines.py script to generate the correct ubuntu-based
baselines.json file.
---
Nitpick comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 12-27: Both load_results and load_baseline_means functions lack
error handling for malformed JSON input and missing expected keys. Add
try/except blocks around the JSON parsing and data access logic in both
functions to catch KeyError and TypeError exceptions. When these exceptions
occur, raise a new exception with an actionable error message that includes the
file path and details about which expected key or structure was missing. This
will provide meaningful error context instead of unhelpful KeyError exceptions
when CI receives malformed JSON.
- Line 45: The ratio calculation at the line where ratio is assigned as cur
divided by base lacks protection against division by zero. Add a guard condition
to check if the base value is zero before performing the division operation. If
base is zero, handle it appropriately by either skipping the ratio calculation,
assigning a special value, or logging a warning. This defensive check will
prevent ZeroDivisionError exceptions even though zero baseline means are
unlikely in practice.
In `@scripts/reduce_baselines.py`:
- Around line 44-49: The `--slack` argument parser definition lacks validation
to ensure the parameter value is valid for use as a multiplier. Add a constraint
to the `--slack` argument definition to validate that the value is greater than
zero, rejecting negative or zero values that would produce incorrect baseline
means. You can use either the `choices` parameter with a custom validation
function or leverage the `type` parameter with a callable validator to enforce
this constraint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 575f2d48-af8a-4cd9-9998-d03f8ce771b6
📒 Files selected for processing (8)
.github/workflows/ci.yml.gitignoreMakefilebenchmarks/README.mdbenchmarks/baselines.jsonscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/test_check_benchmark_regression.py
Raise BenchmarkDataError for malformed JSON or missing keys; skip zero baselines safely; reject non-positive --slack values.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_check_benchmark_regression.py (1)
65-77: ⚡ Quick winAdd an explicit boundary test for exactly +20% (ratio == 1.20).
Current coverage checks above and below threshold, but not the exact cutoff. Since gating is defined as
current / baseline > 1.20, add a case (e.g.,0.00012vs0.0001) asserting exit code0to lock in boundary semantics.Also applies to: 80-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_check_benchmark_regression.py` around lines 65 - 77, The current test coverage for the check_regression function does not include an explicit test for the exact boundary condition where the performance ratio equals 1.20 (the threshold). Add a new test function that verifies the exact boundary case, such as using a current value of 0.00012 against a baseline of 0.0001, which produces exactly a 1.20 ratio, and assert that check_regression returns 0 (passes). This locks in the boundary semantics and ensures the gating logic defined as current / baseline > 1.20 is correct at the exact cutoff point.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_check_benchmark_regression.py`:
- Around line 65-77: The current test coverage for the check_regression function
does not include an explicit test for the exact boundary condition where the
performance ratio equals 1.20 (the threshold). Add a new test function that
verifies the exact boundary case, such as using a current value of 0.00012
against a baseline of 0.0001, which produces exactly a 1.20 ratio, and assert
that check_regression returns 0 (passes). This locks in the boundary semantics
and ensures the gating logic defined as current / baseline > 1.20 is correct at
the exact cutoff point.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7390034-4231-4d61-b70c-0a328b1bbef3
📒 Files selected for processing (3)
scripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/test_check_benchmark_regression.py
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/reduce_baselines.py
- scripts/check_benchmark_regression.py
|
@CodeRabbit, please review this PR in more depth |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py
echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py
echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py
echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.json
echo ""
echo "=== Makefile ==="
cat Makefile
echo ""
echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml
echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md
echo ""
echo "=== .gitignore (last 20 lines) ==="
tail -20 .gitignoreLength of output: 21437
Deep Review: PR
|
| Scenario | Expected |
|---|---|
Valid raw JSON → correct baselines.json structure |
Pass |
Missing "benchmarks" key |
BenchmarkDataError (once added) |
--slack 0 or --slack -1 |
argparse.ArgumentTypeError |
| Non-gated group entries filtered out | Absent from output |
machine_info present → machine field populated |
Populated correctly |
6. Missing test for the exact 1.20× boundary
The gate condition is ratio > threshold (strict), so a ratio of exactly 1.20 must pass. This is not tested:
def test_exactly_at_threshold_passes(tmp_path) -> None:
results = tmp_path / "results.json"
baselines = tmp_path / "baselines.json"
_write_results(results, [{"name": "test_parse_session_small", "stats": {"mean": 0.000120}}])
_write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
assert check_regression(results, baselines) == 0 # 1.20x is not > 1.207. Missing test for "no current result for baseline" warning path
test_missing_baseline_warns_without_failing tests the case where a result has no matching baseline (WARN path). The opposite case — baseline entry exists but result is absent — is never tested:
def test_missing_current_result_warns_without_failing(tmp_path, capsys) -> None:
results = tmp_path / "results.json"
baselines = tmp_path / "baselines.json"
_write_results(results, []) # empty
_write_baselines(baselines, {"parse": {"test_parse_session_small": 0.0001}})
assert check_regression(results, baselines) == 0
assert "no current result for baseline" in capsys.readouterr().out8. main() does not expose --threshold as a CLI argument
The THRESHOLD = 1.20 constant is not overridable from the command line. For a tool designed to be run in CI, this makes it harder to locally test different thresholds or temporarily loosen the gate for a branch:
parser.add_argument("--threshold", type=float, default=THRESHOLD,
help="regression ratio threshold (default: 1.20)")🟢 Low / Informational
9. baselines.json — Timestamp appears hand-crafted
The "updated" field reads "2026-06-17T21:00:00Z" — a suspiciously round timestamp. reduce_baselines.py uses datetime.now(UTC) which produces sub-second precision. If this was captured by a real CI run, the timestamp should look like "2026-06-17T21:03:47Z". If the values were manually authored, they may not reflect actual post-cache performance on ubuntu-latest.
10. benchmarks/README.md — Misleading phrasing
"Do not capture baselines on Windows/macOS for commit — ubuntu runners are slower and the gate will fail."
"ubuntu runners are slower" contradicts the intent. The actual reason is that baselines must match the gated runner OS (ubuntu-latest). Consider:
"Baselines must be captured on ubuntu-latest to match the CI runner. Cross-OS variance will cause spurious failures."
11. Makefile — No clean target
After running make update-baselines or make check-benchmarks, benchmarks/_raw.json and benchmark-results.json are left on disk (both in .gitignore). A clean target would be ergonomic:
clean:
rm -f benchmarks/_raw.json benchmark-results.jsonSummary table
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | 🔴 | ci.yml |
Artifact upload skipped when regression gate fails |
| 2 | 🔴 | reduce_baselines.py |
No error handling on malformed input |
| 3 | 🔴 | check_benchmark_regression.py |
BenchmarkDataError uncaught in main() |
| 4 | 🟡 | check_benchmark_regression.py |
load_baseline_means fallback silently discards data on missing groups key |
| 5 | 🟡 | tests/ |
No tests for reduce_baselines.py |
| 6 | 🟡 | tests/ |
No test for exact 1.20× boundary |
| 7 | 🟡 | tests/ |
No test for missing-current-result WARN path |
| 8 | 🟡 | check_benchmark_regression.py |
--threshold not exposed as CLI argument |
| 9 | 🟢 | baselines.json |
Timestamp looks hand-crafted, not from real CI run |
| 10 | 🟢 | benchmarks/README.md |
Misleading "ubuntu runners are slower" phrasing |
| 11 | 🟢 | Makefile |
No clean target |
The most impactful fix to make immediately is #1 (if: always() on the artifact upload step) — losing the benchmark JSON on a regression failure defeats the purpose of the gate.
Upload benchmark-results.json even when the gate fails; add BenchmarkDataError handling to reduce_baselines; require explicit groups key; expose --threshold CLI flag; add reduce_baselines and boundary tests; clarify README.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
scripts/check_benchmark_regression.py (1)
34-41:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWrap
stats.meannumeric conversion inBenchmarkDataError.At Line 41,
float(mean)can raiseValueError/TypeErrorand escape uncaught, causing a traceback instead of the CLI’s structured error path (ERROR:+ exit code2).Suggested fix
try: name = entry["name"] mean = entry["stats"]["mean"] except (KeyError, TypeError) as exc: raise BenchmarkDataError( f"{path} benchmarks[{index}] missing 'name' or 'stats.mean'" ) from exc - results[str(name)] = float(mean) + try: + results[str(name)] = float(mean) + except (TypeError, ValueError) as exc: + raise BenchmarkDataError( + f"{path} benchmarks[{index}] has non-numeric 'stats.mean'" + ) from exc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/check_benchmark_regression.py` around lines 34 - 41, The conversion of mean to float using float(mean) on line 41 can raise ValueError or TypeError exceptions that are not caught by the existing try-except block, causing unhandled exceptions instead of properly formatted CLI errors. Move the results assignment statement containing float(mean) inside the try block (before the except clause) so that any conversion errors are caught and wrapped in BenchmarkDataError along with the other data validation errors.scripts/reduce_baselines.py (1)
58-63:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winValidate
machine_infoshape before calling.get().If
machine_infois present but not an object,machine_info.get("system")raisesAttributeErrorand bypasses your structured error handling.Suggested fix
- machine_info = raw.get("machine_info", {}) + machine_info = raw.get("machine_info", {}) + if machine_info is not None and not isinstance(machine_info, dict): + raise BenchmarkDataError(f"{path} 'machine_info' must be an object when present") @@ - "machine": machine_info.get("system"), + "machine": machine_info.get("system") if isinstance(machine_info, dict) else None,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/reduce_baselines.py` around lines 58 - 63, The code calls machine_info.get("system") without validating that machine_info is actually a dictionary, which will raise an AttributeError if machine_info is not a dict object. Before calling .get() on machine_info, add a type check to ensure it is a dictionary instance, and provide a safe fallback value (such as None) if the validation fails. This will ensure the structured error handling is not bypassed by unexpected data types in the machine_info field.
🧹 Nitpick comments (1)
tests/test_reduce_baselines.py (1)
77-78: ⚡ Quick winAssert the
SystemExitcode for stricter CLI validation.Line 77 currently accepts any
SystemExit, which can hide unrelated parse failures. Assertexc.value.code == 2so the test specifically verifies argparse rejection for non-positive--slack.Proposed test tightening
- with pytest.raises(SystemExit): - main([str(raw), str(tmp_path / "out.json"), "--slack", "0"]) + with pytest.raises(SystemExit) as exc: + main([str(raw), str(tmp_path / "out.json"), "--slack", "0"]) + assert exc.value.code == 2🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_reduce_baselines.py` around lines 77 - 78, The pytest.raises(SystemExit) on the main function call with the `--slack` argument currently accepts any SystemExit without validating the specific exit code. Modify the context manager to capture the exception info using `as exc_info` and then assert that `exc_info.value.code == 2` after the main function call completes, ensuring the test strictly validates that argparse is rejecting the non-positive slack value (exit code 2) rather than failing for unrelated reasons.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/reduce_baselines.py`:
- Around line 29-33: The code catches json.JSONDecodeError but does not handle
OSError that can be raised by path.read_text() and path.write_text() methods,
allowing raw tracebacks to leak for common CLI failures like missing files or
permission issues. Wrap path.read_text() calls in a try-except block that
catches both json.JSONDecodeError and OSError, converting OSError to
BenchmarkDataError with the error details included in the message. Apply the
same pattern to all path.write_text() calls in the script (at lines 65-66 and
81-85) to ensure consistent error handling throughout and prevent
filesystem-related exceptions from propagating as raw tracebacks.
---
Outside diff comments:
In `@scripts/check_benchmark_regression.py`:
- Around line 34-41: The conversion of mean to float using float(mean) on line
41 can raise ValueError or TypeError exceptions that are not caught by the
existing try-except block, causing unhandled exceptions instead of properly
formatted CLI errors. Move the results assignment statement containing
float(mean) inside the try block (before the except clause) so that any
conversion errors are caught and wrapped in BenchmarkDataError along with the
other data validation errors.
In `@scripts/reduce_baselines.py`:
- Around line 58-63: The code calls machine_info.get("system") without
validating that machine_info is actually a dictionary, which will raise an
AttributeError if machine_info is not a dict object. Before calling .get() on
machine_info, add a type check to ensure it is a dictionary instance, and
provide a safe fallback value (such as None) if the validation fails. This will
ensure the structured error handling is not bypassed by unexpected data types in
the machine_info field.
---
Nitpick comments:
In `@tests/test_reduce_baselines.py`:
- Around line 77-78: The pytest.raises(SystemExit) on the main function call
with the `--slack` argument currently accepts any SystemExit without validating
the specific exit code. Modify the context manager to capture the exception info
using `as exc_info` and then assert that `exc_info.value.code == 2` after the
main function call completes, ensuring the test strictly validates that argparse
is rejecting the non-positive slack value (exit code 2) rather than failing for
unrelated reasons.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e6f0d19f-683d-41e8-a1eb-2f97136d0bc1
📒 Files selected for processing (7)
.github/workflows/ci.ymlMakefilebenchmarks/README.mdscripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/test_check_benchmark_regression.pytests/test_reduce_baselines.py
✅ Files skipped from review due to trivial changes (1)
- benchmarks/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- Makefile
- .github/workflows/ci.yml
Wrap reduce_baselines I/O in BenchmarkDataError; guard machine_info type; move float(mean) into try block in load_results; assert argparse exit code 2.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/reduce_baselines.py (1)
11-11:⚠️ Potential issue | 🟠 MajorFix import mode mismatch at Line 11 to prevent startup crashes.
from scripts.check_benchmark_regression import BenchmarkDataErrorfails withModuleNotFoundError: No module named 'scripts'when this file is run aspython scripts/reduce_baselines.py. Thescriptsdirectory lacks an__init__.pyfile and the absolute import does not work in direct script execution mode, despite this being the documented invocation pattern in Makefile and benchmarks/README.md.Suggested fix
-from scripts.check_benchmark_regression import BenchmarkDataError +if __package__: + from .check_benchmark_regression import BenchmarkDataError +else: + from check_benchmark_regression import BenchmarkDataError🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/reduce_baselines.py` at line 11, The absolute import of BenchmarkDataError from scripts.check_benchmark_regression fails when reduce_baselines.py is run directly as a script because the scripts directory is not a Python package. Change the import statement from the absolute import `from scripts.check_benchmark_regression import BenchmarkDataError` to a relative import `from .check_benchmark_regression import BenchmarkDataError` since both files are in the same scripts directory. This will allow the module to be correctly resolved when the file is executed as a script.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@scripts/reduce_baselines.py`:
- Line 11: The absolute import of BenchmarkDataError from
scripts.check_benchmark_regression fails when reduce_baselines.py is run
directly as a script because the scripts directory is not a Python package.
Change the import statement from the absolute import `from
scripts.check_benchmark_regression import BenchmarkDataError` to a relative
import `from .check_benchmark_regression import BenchmarkDataError` since both
files are in the same scripts directory. This will allow the module to be
correctly resolved when the file is executed as a script.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f630845d-4f13-4730-8970-fe8278b98d3c
📒 Files selected for processing (3)
scripts/check_benchmark_regression.pyscripts/reduce_baselines.pytests/test_reduce_baselines.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_reduce_baselines.py
- scripts/check_benchmark_regression.py
Drop test_parse_session_small and test_search_full_corpus from gated baselines; clarify ubuntu CI provenance in _note; detect duplicate benchmark names; expand reduce_baselines tests.
d8ce6cf to
001c5f0
Compare
|
@CodeRabbit, can you verify if @timon0305 's all comments are addressed? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
echo "=== .github/workflows/ci.yml ==="
cat .github/workflows/ci.yml
echo ""
echo "=== scripts/check_benchmark_regression.py ==="
cat scripts/check_benchmark_regression.py
echo ""
echo "=== scripts/reduce_baselines.py ==="
cat scripts/reduce_baselines.py
echo ""
echo "=== tests/test_check_benchmark_regression.py ==="
cat tests/test_check_benchmark_regression.py
echo ""
echo "=== tests/test_reduce_baselines.py ==="
cat tests/test_reduce_baselines.py
echo ""
echo "=== Makefile ==="
cat Makefile
echo ""
echo "=== benchmarks/README.md ==="
cat benchmarks/README.md
echo ""
echo "=== benchmarks/baselines.json ==="
cat benchmarks/baselines.jsonLength of output: 30061
Summary: All 3 critical 🔴 issues and all 5 medium 🟡 issues are fully addressed. 2 of 3 low/informational 🟢 issues are also resolved. The only remaining item is Issue |
Closes #83
Summary
Adds a +20% benchmark regression gate to CI, populates
benchmarks/baselines.jsonfrom a post-cache run, and documents how to refresh baselines.Sprint item #6 (3 pt) — Wednesday PR 2 of 2. Depends on PR 1 (#90 session cache) so baselines reflect cached-path performance.
Problem
The CI
benchmarksjob ranpytest tests/benchmarks/ --benchmark-onlyand uploaded artifacts, but was labeled informational with no threshold.benchmarks/baselines.jsonhad empty groups, so parse/export/search regressions passed silently.Changes
scripts/check_benchmark_regression.py(new)benchmarks/baselines.jsoncurrent_mean / baseline_mean > 1.20scripts/reduce_baselines.py(new)--benchmark-jsonoutput intobaselines.json(parse / export / search groups only)--slackfactor for cross-machine capture (e.g. faster local host vs ubuntu CI)benchmarks/baselines.jsoncachegroup benchmarks are informational only (not gated)CI (
.github/workflows/ci.yml)Performance benchmarks (informational)→Performance benchmarks (gated)python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.jsonMakefile(new)make update-baselines— run benchmarks and regeneratebaselines.jsonmake check-benchmarks— local gate checkTests
tests/test_check_benchmark_regression.py— missing-baseline warn (pass), >20% regression (fail), within threshold (pass)Docs / hygiene
benchmarks/README.md— gate behavior and baseline refresh instructions.gitignore—benchmark-results.json,benchmarks/_raw.jsonAcceptance criteria
benchmarks/baselines.jsonpopulated from post-cache runmake update-baselinesregenerates baselinesTest plan
pytest -q— 388 passedpytest tests/test_check_benchmark_regression.py -qpython scripts/check_benchmark_regression.py benchmarks/_raw.json benchmarks/baselines.jsonruff check .Notes
--slack 1.35for ubuntu headroom. If the gated CI job fails on first run due to runner variance, refresh on ubuntu viamake update-baselinesand recommit.Summary by CodeRabbit